Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
debakarr
GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 10 - Model Selection And Boosting/XGBoost/[Python] XGBoost.ipynb
1341 views
Kernel: Python 3

XGBoost

Data Preprocessing

# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd %matplotlib inline plt.rcParams['figure.figsize'] = [14, 8]
# Importing the dataset dataset = pd.read_csv('Churn_Modelling.csv')
dataset.head(1)
X = dataset.iloc[:, [3, 4, 6, 7, 8, 9, 10, 11, 12]].values
y = dataset.iloc[:, 13].values
X[0]
array([619, 'France', 42, 2, 0.0, 1, 1, 1, 101348.88], dtype=object)
y[0]
1
# Encoding categorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[:, 1] = labelencoder_X.fit_transform(X[:, 1]) onehotencoder = OneHotEncoder(categorical_features = [1]) X = onehotencoder.fit_transform(X).toarray() X = X[:, 1:]
X[0:2]
array([[ 0.00000000e+00, 0.00000000e+00, 6.19000000e+02, 4.20000000e+01, 2.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.01348880e+05], [ 0.00000000e+00, 1.00000000e+00, 6.08000000e+02, 4.10000000e+01, 1.00000000e+00, 8.38078600e+04, 1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.12542580e+05]])
# Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Fitting XGBoost to the training set

from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)

Predicting the Test set results

y_pred = classifier.predict(X_test)
y_pred[0:10]
array([0, 0, 0, 0, 0, 1, 0, 0, 0, 1])
y_test[0:10]
array([0, 1, 0, 0, 0, 1, 0, 0, 1, 1])

Making the confussion Matrix

from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) cm
array([[1532, 63], [ 203, 202]])

Calculating Accuracy

(cm[0][0]+cm[1][1])/np.sum(cm)
0.86699999999999999

Applying k-Fold Cross Validation

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) accuracies # 10 test set accuracies
array([ 0.87640449, 0.8639201 , 0.88125 , 0.86625 , 0.86375 , 0.855 , 0.865 , 0.8575 , 0.8485607 , 0.87359199])
np.mean(accuracies) # mean of accuracies
0.86512272851207572
np.std(accuracies) # startdard deviation of accuracies
0.0094793902817781814

Applying Grid Search to find the best model and the best parameters (Optional)

from sklearn.model_selection import GridSearchCV
help(XGBClassifier())
Help on XGBClassifier in module xgboost.sklearn object: class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin) | Implementation of the scikit-learn API for XGBoost classification. | | Parameters | ---------- | max_depth : int | Maximum tree depth for base learners. | learning_rate : float | Boosting learning rate (xgb's "eta") | n_estimators : int | Number of boosted trees to fit. | silent : boolean | Whether to print messages while running boosting. | objective : string or callable | Specify the learning task and the corresponding learning objective or | a custom objective function to be used (see note below). | booster: string | Specify which booster to use: gbtree, gblinear or dart. | nthread : int | Number of parallel threads used to run xgboost. (Deprecated, please use n_jobs) | n_jobs : int | Number of parallel threads used to run xgboost. (replaces nthread) | gamma : float | Minimum loss reduction required to make a further partition on a leaf node of the tree. | min_child_weight : int | Minimum sum of instance weight(hessian) needed in a child. | max_delta_step : int | Maximum delta step we allow each tree's weight estimation to be. | subsample : float | Subsample ratio of the training instance. | colsample_bytree : float | Subsample ratio of columns when constructing each tree. | colsample_bylevel : float | Subsample ratio of columns for each split, in each level. | reg_alpha : float (xgb's alpha) | L1 regularization term on weights | reg_lambda : float (xgb's lambda) | L2 regularization term on weights | scale_pos_weight : float | Balancing of positive and negative weights. | base_score: | The initial prediction score of all instances, global bias. | seed : int | Random number seed. (Deprecated, please use random_state) | random_state : int | Random number seed. (replaces seed) | missing : float, optional | Value in the data which needs to be present as a missing value. If | None, defaults to np.nan. | **kwargs : dict, optional | Keyword arguments for XGBoost Booster object. Full documentation of parameters can | be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.md. | Attempting to set a parameter via the constructor args and **kwargs dict simultaneously | will result in a TypeError. | Note: | **kwargs is unsupported by Sklearn. We do not guarantee that parameters passed via | this argument will interact properly with Sklearn. | | Note | ---- | A custom objective function can be provided for the ``objective`` | parameter. In this case, it should have the signature | ``objective(y_true, y_pred) -> grad, hess``: | | y_true: array_like of shape [n_samples] | The target values | y_pred: array_like of shape [n_samples] | The predicted values | | grad: array_like of shape [n_samples] | The value of the gradient for each sample point. | hess: array_like of shape [n_samples] | The value of the second derivative for each sample point | | Method resolution order: | XGBClassifier | XGBModel | sklearn.base.BaseEstimator | sklearn.base.ClassifierMixin | builtins.object | | Methods defined here: | | __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs) | Initialize self. See help(type(self)) for accurate signature. | | evals_result(self) | Return the evaluation results. | | If eval_set is passed to the `fit` function, you can call evals_result() to | get evaluation results for all passed eval_sets. When eval_metric is also | passed to the `fit` function, the evals_result will contain the eval_metrics | passed to the `fit` function | | Returns | ------- | evals_result : dictionary | | Example | ------- | param_dist = {'objective':'binary:logistic', 'n_estimators':2} | | clf = xgb.XGBClassifier(**param_dist) | | clf.fit(X_train, y_train, | eval_set=[(X_train, y_train), (X_test, y_test)], | eval_metric='logloss', | verbose=True) | | evals_result = clf.evals_result() | | The variable evals_result will contain: | {'validation_0': {'logloss': ['0.604835', '0.531479']}, | 'validation_1': {'logloss': ['0.41965', '0.17686']}} | | fit(self, X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None) | Fit gradient boosting classifier | | Parameters | ---------- | X : array_like | Feature matrix | y : array_like | Labels | sample_weight : array_like | Weight for each instance | eval_set : list, optional | A list of (X, y) pairs to use as a validation set for | early-stopping | eval_metric : str, callable, optional | If a str, should be a built-in evaluation metric to use. See | doc/parameter.md. If callable, a custom evaluation metric. The call | signature is func(y_predicted, y_true) where y_true will be a | DMatrix object such that you may need to call the get_label | method. It must return a str, value pair where the str is a name | for the evaluation and value is the value of the evaluation | function. This objective is always minimized. | early_stopping_rounds : int, optional | Activates early stopping. Validation error needs to decrease at | least every <early_stopping_rounds> round(s) to continue training. | Requires at least one item in evals. If there's more than one, | will use the last. Returns the model from the last iteration | (not the best one). If early stopping occurs, the model will | have three additional fields: bst.best_score, bst.best_iteration | and bst.best_ntree_limit. | (Use bst.best_ntree_limit to get the correct value if num_parallel_tree | and/or num_class appears in the parameters) | verbose : bool | If `verbose` and an evaluation set is used, writes the evaluation | metric measured on the validation set to stderr. | xgb_model : str | file name of stored xgb model or 'Booster' instance Xgb model to be | loaded before training (allows training continuation). | | predict(self, data, output_margin=False, ntree_limit=0) | | predict_proba(self, data, output_margin=False, ntree_limit=0) | | ---------------------------------------------------------------------- | Methods inherited from XGBModel: | | __setstate__(self, state) | | apply(self, X, ntree_limit=0) | Return the predicted leaf every tree for each sample. | | Parameters | ---------- | X : array_like, shape=[n_samples, n_features] | Input features matrix. | | ntree_limit : int | Limit number of trees in the prediction; defaults to 0 (use all trees). | | Returns | ------- | X_leaves : array_like, shape=[n_samples, n_trees] | For each datapoint x in X and for each tree, return the index of the | leaf x ends up in. Leaves are numbered within | ``[0; 2**(self.max_depth+1))``, possibly with gaps in the numbering. | | get_booster(self) | Get the underlying xgboost Booster of this model. | | This will raise an exception when fit was not called | | Returns | ------- | booster : a xgboost booster of underlying model | | get_params(self, deep=False) | Get parameters. | | get_xgb_params(self) | Get xgboost type parameters. | | ---------------------------------------------------------------------- | Data descriptors inherited from XGBModel: | | feature_importances_ | Returns | ------- | feature_importances_ : array of shape = [n_features] | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base.BaseEstimator: | | __getstate__(self) | | __repr__(self) | Return repr(self). | | set_params(self, **params) | Set the parameters of this estimator. | | The method works on simple estimators as well as on nested objects | (such as pipelines). The latter have parameters of the form | ``<component>__<parameter>`` so that it's possible to update each | component of a nested object. | | Returns | ------- | self | | ---------------------------------------------------------------------- | Data descriptors inherited from sklearn.base.BaseEstimator: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base.ClassifierMixin: | | score(self, X, y, sample_weight=None) | Returns the mean accuracy on the given test data and labels. | | In multi-label classification, this is the subset accuracy | which is a harsh metric since you require for each sample that | each label set be correctly predicted. | | Parameters | ---------- | X : array-like, shape = (n_samples, n_features) | Test samples. | | y : array-like, shape = (n_samples) or (n_samples, n_outputs) | True labels for X. | | sample_weight : array-like, shape = [n_samples], optional | Sample weights. | | Returns | ------- | score : float | Mean accuracy of self.predict(X) wrt. y.
# Tried various parameters, this one is the best till now. parameters = [{'max_depth': [3], 'learning_rate': [0.1], 'n_estimators': [250], 'booster': ['gbtree', 'gblinear', 'dart']}] grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1) grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_ best_accuracy
0.86550000000000005
best_parameters = grid_search.best_params_ best_parameters
{'booster': 'gbtree', 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 250}